44 research outputs found
Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? A Metric for Unsupervised Evaluation of Domain Adaptation
Unsupervised domain adaptation (UDA) involves adapting a model trained on a
label-rich source domain to an unlabeled target domain. However, in real-world
scenarios, the absence of target-domain labels makes it challenging to evaluate
the performance of deep models after UDA. Additionally, prevailing UDA methods
typically rely on adversarial training and self-training, which could lead to
model degeneration and negative transfer, further exacerbating the evaluation
problem. In this paper, we propose a novel metric called the \textit{Transfer
Score} to address these issues. The transfer score enables the unsupervised
evaluation of domain adaptation models by assessing the spatial uniformity of
the classifier via model parameters, as well as the transferability and
discriminability of the feature space. Based on unsupervised evaluation using
our metric, we achieve three goals: (1) selecting the most suitable UDA method
from a range of available options, (2) optimizing hyperparameters of UDA models
to prevent model degeneration, and (3) identifying the epoch at which the
adapted model performs optimally. Our work bridges the gap between UDA research
and practical UDA evaluation, enabling a realistic assessment of UDA model
performance. We validate the effectiveness of our metric through extensive
empirical studies conducted on various public datasets. The results demonstrate
the utility of the transfer score in evaluating UDA models and its potential to
enhance the overall efficacy of UDA techniques
Confidence Attention and Generalization Enhanced Distillation for Continuous Video Domain Adaptation
Continuous Video Domain Adaptation (CVDA) is a scenario where a source model
is required to adapt to a series of individually available changing target
domains continuously without source data or target supervision. It has wide
applications, such as robotic vision and autonomous driving. The main
underlying challenge of CVDA is to learn helpful information only from the
unsupervised target data while avoiding forgetting previously learned knowledge
catastrophically, which is out of the capability of previous Video-based
Unsupervised Domain Adaptation methods. Therefore, we propose a
Confidence-Attentive network with geneRalization enhanced self-knowledge
disTillation (CART) to address the challenge in CVDA. Firstly, to learn from
unsupervised domains, we propose to learn from pseudo labels. However, in
continuous adaptation, prediction errors can accumulate rapidly in pseudo
labels, and CART effectively tackles this problem with two key modules.
Specifically, The first module generates refined pseudo labels using model
predictions and deploys a novel attentive learning strategy. The second module
compares the outputs of augmented data from the current model to the outputs of
weakly augmented data from the source model, forming a novel consistency
regularization on the model to alleviate the accumulation of prediction errors.
Extensive experiments suggest that the CVDA performance of CART outperforms
existing methods by a considerable margin.Comment: 16 pages, 9 tables, 10 figure
Effective Action Recognition with Embedded Key Point Shifts
Temporal feature extraction is an essential technique in video-based action
recognition. Key points have been utilized in skeleton-based action recognition
methods but they require costly key point annotation. In this paper, we propose
a novel temporal feature extraction module, named Key Point Shifts Embedding
Module (), to adaptively extract channel-wise key point shifts across
video frames without key point annotation for temporal feature extraction. Key
points are adaptively extracted as feature points with maximum feature values
at split regions, while key point shifts are the spatial displacements of
corresponding key points. The key point shifts are encoded as the overall
temporal features via linear embedding layers in a multi-set manner. Our method
achieves competitive performance through embedding key point shifts with
trivial computational cost, achieving the state-of-the-art performance of
82.05% on Mini-Kinetics and competitive performance on UCF101,
Something-Something-v1, and HMDB51 datasets.Comment: 35 pages, 10 figure
Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation
To enable video models to be applied seamlessly across video tasks in
different environments, various Video Unsupervised Domain Adaptation (VUDA)
methods have been proposed to improve the robustness and transferability of
video models. Despite improvements made in model robustness, these VUDA methods
require access to both source data and source model parameters for adaptation,
raising serious data privacy and model portability issues. To cope with the
above concerns, this paper firstly formulates Black-box Video Domain Adaptation
(BVDA) as a more realistic yet challenging scenario where the source video
model is provided only as a black-box predictor. While a few methods for
Black-box Domain Adaptation (BDA) are proposed in image domain, these methods
cannot apply to video domain since video modality has more complicated temporal
features that are harder to align. To address BVDA, we propose a novel Endo and
eXo-TEmporal Regularized Network (EXTERN) by applying mask-to-mix strategies
and video-tailored regularizations: endo-temporal regularization and
exo-temporal regularization, performed across both clip and temporal features,
while distilling knowledge from the predictions obtained from the black-box
predictor. Empirical results demonstrate the state-of-the-art performance of
EXTERN across various cross-domain closed-set and partial-set action
recognition benchmarks, which even surpassed most existing video domain
adaptation methods with source data accessibility.Comment: 9 pages, 4 figures, and 4 table
Fully-Connected Spatial-Temporal Graph for Multivariate Time Series Data
Multivariate Time-Series (MTS) data is crucial in various application fields.
With its sequential and multi-source (multiple sensors) properties, MTS data
inherently exhibits Spatial-Temporal (ST) dependencies, involving temporal
correlations between timestamps and spatial correlations between sensors in
each timestamp. To effectively leverage this information, Graph Neural
Network-based methods (GNNs) have been widely adopted. However, existing
approaches separately capture spatial dependency and temporal dependency and
fail to capture the correlations between Different sEnsors at Different
Timestamps (DEDT). Overlooking such correlations hinders the comprehensive
modelling of ST dependencies within MTS data, thus restricting existing GNNs
from learning effective representations. To address this limitation, we propose
a novel method called Fully-Connected Spatial-Temporal Graph Neural Network
(FC-STGNN), including two key components namely FC graph construction and FC
graph convolution. For graph construction, we design a decay graph to connect
sensors across all timestamps based on their temporal distances, enabling us to
fully model the ST dependencies by considering the correlations between DEDT.
Further, we devise FC graph convolution with a moving-pooling GNN layer to
effectively capture the ST dependencies for learning effective representations.
Extensive experiments show the effectiveness of FC-STGNN on multiple MTS
datasets compared to SOTA methods.Comment: 9 pages, 8 figure
Graph Contextual Contrasting for Multivariate Time Series Classification
Contrastive learning, as a self-supervised learning paradigm, becomes popular
for Multivariate Time-Series (MTS) classification. It ensures the consistency
across different views of unlabeled samples and then learns effective
representations for these samples. Existing contrastive learning methods mainly
focus on achieving temporal consistency with temporal augmentation and
contrasting techniques, aiming to preserve temporal patterns against
perturbations for MTS data. However, they overlook spatial consistency that
requires the stability of individual sensors and their correlations. As MTS
data typically originate from multiple sensors, ensuring spatial consistency
becomes essential for the overall performance of contrastive learning on MTS
data. Thus, we propose Graph Contextual Contrasting (GCC) for spatial
consistency across MTS data. Specifically, we propose graph augmentations
including node and edge augmentations to preserve the stability of sensors and
their correlations, followed by graph contrasting with both node- and
graph-level contrasting to extract robust sensor- and global-level features. We
further introduce multi-window temporal contrasting to ensure temporal
consistency in the data for each sensor. Extensive experiments demonstrate that
our proposed GCC achieves state-of-the-art performance on various MTS
classification tasks.Comment: 9 pages, 5 figure
MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing
4D human perception plays an essential role in a myriad of applications, such
as home automation and metaverse avatar simulation. However, existing solutions
which mainly rely on cameras and wearable devices are either privacy intrusive
or inconvenient to use. To address these issues, wireless sensing has emerged
as a promising alternative, leveraging LiDAR, mmWave radar, and WiFi signals
for device-free human sensing. In this paper, we propose MM-Fi, the first
multi-modal non-intrusive 4D human dataset with 27 daily or rehabilitation
action categories, to bridge the gap between wireless sensing and high-level
human perception tasks. MM-Fi consists of over 320k synchronized frames of five
modalities from 40 human subjects. Various annotations are provided to support
potential sensing tasks, e.g., human pose estimation and action recognition.
Extensive experiments have been conducted to compare the sensing capacity of
each or several modalities in terms of multiple tasks. We envision that MM-Fi
can contribute to wireless sensing research with respect to action recognition,
human pose estimation, multi-modal learning, cross-modal supervision, and
interdisciplinary healthcare research.Comment: The paper has been accepted by NeurIPS 2023 Datasets and Benchmarks
Track. Project page: https://ntu-aiot-lab.github.io/mm-f
Pollen source areas of lakes with inflowing rivers: modern pollen influx data from Lake Baiyangdian, China
Comparing pollen influx recorded in traps above the surface and below the surface of Lake Baiyangdian in northern China shows that the average pollen influx in the traps above the surface is much lower, at 1210 grains cm−2 a−1 (varying from 550 to 2770 grains cm−2 a−1), than in the traps below the surface which average 8990 grains cm−2 a−1 (ranging from 430 to 22310 grains cm−2 a−1). This suggests that about 12% of the total pollen influx is transported by air, and 88% via inflowing water. If hydrophyte pollen types are not included, the mean pollen influx in the traps above the surface decreases to 470 grains cm−2 a−1 (varying from 170 to 910 grains cm−2 a−1) and to 5470 grains cm−2 a−1 in the traps below the surface (ranging from 270 to 12820 grains cm−2 a−1), suggesting that the contribution of waterborne pollen to the non-hydrophyte pollen assemblages in Lake Baiyangdian is about 92%. When trap assemblages are compared with sediment–water interface samples from the same location, the differences between pollen assemblages collected using different methods are more significant than differences between assemblages collected at different sample sites in the lake using the same trapping methods. We compare the ratios of terrestrial pollen and aquicolous pollen types (T/A) between traps in the water and aerial traps, and examine pollen assemblages to determine whether proportions of long-distance taxa (i.e. those known to only grow beyond the estimated aerial source radius); these data suggest that the pollen source area of this lake is composed of three parts, an aerial component mainly carried by wind, a fluvial catchment component transported by rivers and another waterborne component transported by surface wash. Where the overall vegetation composition within the ‘aerial catchment’ is different from that of the hydrological catchment, the ratio between aerial and waterborne pollen influx offers a method for estimating the relative importance of these two sources, and therefore a starting point for defining a pollen source area for a lake with inflowing rivers